EDA is the basic step to analyze and perform intial examination.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectFromModel
from sklearn.utils import shuffle
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from natsort import index_natsorted
Intially, we import the most fundamental libraries that will help us to perform EDA and Ml algorithm.
data_icu = pd.read_csv('Kaggle_Sirio_Libanes_ICU_Prediction.csv')
We use the pandas command to read our.csv file and declare our dataset as data_icu.
data_icu.head()
| PATIENT_VISIT_IDENTIFIER | AGE_ABOVE65 | AGE_PERCENTIL | GENDER | DISEASE GROUPING 1 | DISEASE GROUPING 2 | DISEASE GROUPING 3 | DISEASE GROUPING 4 | DISEASE GROUPING 5 | DISEASE GROUPING 6 | ... | TEMPERATURE_DIFF | OXYGEN_SATURATION_DIFF | BLOODPRESSURE_DIASTOLIC_DIFF_REL | BLOODPRESSURE_SISTOLIC_DIFF_REL | HEART_RATE_DIFF_REL | RESPIRATORY_RATE_DIFF_REL | TEMPERATURE_DIFF_REL | OXYGEN_SATURATION_DIFF_REL | WINDOW | ICU | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 60th | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | 0-2 | 0 |
| 1 | 0 | 1 | 60th | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | 2-4 | 0 |
| 2 | 0 | 1 | 60th | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 4-6 | 0 |
| 3 | 0 | 1 | 60th | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | -1.000000 | -1.000000 | NaN | NaN | NaN | NaN | -1.000000 | -1.000000 | 6-12 | 0 |
| 4 | 0 | 1 | 60th | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | -0.238095 | -0.818182 | -0.389967 | 0.407558 | -0.230462 | 0.096774 | -0.242282 | -0.814433 | ABOVE_12 | 1 |
5 rows × 231 columns
data_icu's first five rows are displayed by using .head() function.
data_icu.tail()
| PATIENT_VISIT_IDENTIFIER | AGE_ABOVE65 | AGE_PERCENTIL | GENDER | DISEASE GROUPING 1 | DISEASE GROUPING 2 | DISEASE GROUPING 3 | DISEASE GROUPING 4 | DISEASE GROUPING 5 | DISEASE GROUPING 6 | ... | TEMPERATURE_DIFF | OXYGEN_SATURATION_DIFF | BLOODPRESSURE_DIASTOLIC_DIFF_REL | BLOODPRESSURE_SISTOLIC_DIFF_REL | HEART_RATE_DIFF_REL | RESPIRATORY_RATE_DIFF_REL | TEMPERATURE_DIFF_REL | OXYGEN_SATURATION_DIFF_REL | WINDOW | ICU | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1920 | 384 | 0 | 50th | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | 0-2 | 0 |
| 1921 | 384 | 0 | 50th | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | 2-4 | 0 |
| 1922 | 384 | 0 | 50th | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | 4-6 | 0 |
| 1923 | 384 | 0 | 50th | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | 6-12 | 0 |
| 1924 | 384 | 0 | 50th | 1 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | -0.547619 | -0.838384 | -0.701863 | -0.585967 | -0.763868 | -0.612903 | -0.551337 | -0.835052 | ABOVE_12 | 0 |
5 rows × 231 columns
data_icu's last five rows are displayed by using .tail() function.
(data_icu.shape)
(1925, 231)
.shape was used to check the shape of data set like number of rows and columns,and it shows that (rows,coloumns): (1925,231)
def agr_perc_to_int(percentil):
if percentil == "Above 90th":
return (100)
else:
return(int("".join(c for c in str(percentil) if c.isdigit())))
data_icu["AGE_PERCENTIL"] = data_icu.AGE_PERCENTIL.apply(lambda data_icu: agr_perc_to_int(data_icu))
set(data_icu["AGE_PERCENTIL"].values)
{10, 20, 30, 40, 50, 60, 70, 80, 90, 100}
def cat_window (window):
if window == "ABOVE_12":
return(13)
else:
return(int((window.split("-") [1])))
data_icu['WINDOW'] = data_icu['WINDOW'].apply(lambda x: cat_window(x))
data_icu['WINDOW'].isnull().sum()
0
In the Above function it found that two columns were as objects (AGE_PERCENTIL and WINDOW), To avoid any numerical issues, it is suggested to convert them to numerical data or string to float. Above methods convert string into float.
data_icu.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1925 entries, 0 to 1924 Columns: 231 entries, PATIENT_VISIT_IDENTIFIER to ICU dtypes: float64(225), int64(6) memory usage: 3.4 MB
Using the info() function, we can determine if the data is an integer or a float.
data_icu.describe()
| PATIENT_VISIT_IDENTIFIER | AGE_ABOVE65 | AGE_PERCENTIL | GENDER | DISEASE GROUPING 1 | DISEASE GROUPING 2 | DISEASE GROUPING 3 | DISEASE GROUPING 4 | DISEASE GROUPING 5 | DISEASE GROUPING 6 | ... | TEMPERATURE_DIFF | OXYGEN_SATURATION_DIFF | BLOODPRESSURE_DIASTOLIC_DIFF_REL | BLOODPRESSURE_SISTOLIC_DIFF_REL | HEART_RATE_DIFF_REL | RESPIRATORY_RATE_DIFF_REL | TEMPERATURE_DIFF_REL | OXYGEN_SATURATION_DIFF_REL | WINDOW | ICU | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1925.000000 | 1925.000000 | 1925.000000 | 1925.000000 | 1920.000000 | 1920.000000 | 1920.000000 | 1920.000000 | 1920.000000 | 1920.000000 | ... | 1231.000000 | 1239.000000 | 1240.000000 | 1240.000000 | 1240.000000 | 1177.000000 | 1231.000000 | 1239.000000 | 1925.000000 | 1925.000000 |
| mean | 192.000000 | 0.467532 | 53.194805 | 0.368831 | 0.108333 | 0.028125 | 0.097917 | 0.019792 | 0.128125 | 0.046875 | ... | -0.770338 | -0.887196 | -0.786997 | -0.715950 | -0.817800 | -0.719147 | -0.771327 | -0.886982 | 7.400000 | 0.267532 |
| std | 111.168431 | 0.499074 | 28.673479 | 0.482613 | 0.310882 | 0.165373 | 0.297279 | 0.139320 | 0.334316 | 0.211426 | ... | 0.319001 | 0.296147 | 0.324754 | 0.419103 | 0.270217 | 0.446600 | 0.317694 | 0.296772 | 4.364619 | 0.442787 |
| min | 0.000000 | 0.000000 | 10.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | 2.000000 | 0.000000 |
| 25% | 96.000000 | 0.000000 | 30.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | 4.000000 | 0.000000 |
| 50% | 192.000000 | 0.000000 | 50.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | -0.976190 | -0.979798 | -1.000000 | -0.984944 | -0.989822 | -1.000000 | -0.975924 | -0.980333 | 6.000000 | 0.000000 |
| 75% | 288.000000 | 1.000000 | 80.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | -0.595238 | -0.878788 | -0.645482 | -0.522176 | -0.662529 | -0.634409 | -0.594677 | -0.880155 | 12.000000 | 1.000000 |
| max | 384.000000 | 1.000000 | 100.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 13.000000 | 1.000000 |
8 rows × 231 columns
data_icu describe() use for displays summary statistics for a python dataframe.
data_icu.count()
PATIENT_VISIT_IDENTIFIER 1925
AGE_ABOVE65 1925
AGE_PERCENTIL 1925
GENDER 1925
DISEASE GROUPING 1 1920
...
RESPIRATORY_RATE_DIFF_REL 1177
TEMPERATURE_DIFF_REL 1231
OXYGEN_SATURATION_DIFF_REL 1239
WINDOW 1925
ICU 1925
Length: 231, dtype: int64
.count() function was used to check the values which are of given variable in our data set(data_icu).
data_icu.nunique()
PATIENT_VISIT_IDENTIFIER 385
AGE_ABOVE65 2
AGE_PERCENTIL 10
GENDER 2
DISEASE GROUPING 1 2
...
RESPIRATORY_RATE_DIFF_REL 200
TEMPERATURE_DIFF_REL 457
OXYGEN_SATURATION_DIFF_REL 187
WINDOW 5
ICU 2
Length: 231, dtype: int64
The nunique()function returns the number of unique values for each column.
data_icu.columns
Index(['PATIENT_VISIT_IDENTIFIER', 'AGE_ABOVE65', 'AGE_PERCENTIL', 'GENDER',
'DISEASE GROUPING 1', 'DISEASE GROUPING 2', 'DISEASE GROUPING 3',
'DISEASE GROUPING 4', 'DISEASE GROUPING 5', 'DISEASE GROUPING 6',
...
'TEMPERATURE_DIFF', 'OXYGEN_SATURATION_DIFF',
'BLOODPRESSURE_DIASTOLIC_DIFF_REL', 'BLOODPRESSURE_SISTOLIC_DIFF_REL',
'HEART_RATE_DIFF_REL', 'RESPIRATORY_RATE_DIFF_REL',
'TEMPERATURE_DIFF_REL', 'OXYGEN_SATURATION_DIFF_REL', 'WINDOW', 'ICU'],
dtype='object', length=231)
We observed columns in datasets by using the.columns function.
data_icu.drop_duplicates()
data_icu.shape
(1925, 231)
data_icu.duplicated(subset=None, keep="first")
0 False
1 False
2 False
3 False
4 False
...
1920 False
1921 False
1922 False
1923 False
1924 False
Length: 1925, dtype: bool
.drop_duplicates method is used to remove duplicate or double column and its show that in our data set their is no duplicate column.
data_icu = data_icu.drop_duplicates(keep='first')
# Use for dropping Null values
data = data_icu.dropna(axis=1,how="all")
data_icu.shape
(1925, 231)
.dropna method are used to drop Null values and this function remove a entire column in which every value is Null.So, its shown that in our data set thier no such column which contain every value null.
data_icu.isnull().sum()
PATIENT_VISIT_IDENTIFIER 0
AGE_ABOVE65 0
AGE_PERCENTIL 0
GENDER 0
DISEASE GROUPING 1 5
...
RESPIRATORY_RATE_DIFF_REL 748
TEMPERATURE_DIFF_REL 694
OXYGEN_SATURATION_DIFF_REL 686
WINDOW 0
ICU 0
Length: 231, dtype: int64
.isnull() function is used to check null values of our data set and using it with .sum() the null values will repsent in tabulor form as shown above . It appear that four variable countain null values .
def _impute_missing_data(data_icu):
return data_icu.replace(-1, np.nan)
data_icu = _impute_missing_data(data_icu)
print('NaN values = ', data_icu.isnull().sum().sum())
print("""""")
vars_with_missing = []
for feature in data_icu.columns:
missings = data_icu[feature].isna().sum()
if missings > 0 :
vars_with_missing.append(feature)
missings_perc = missings / data_icu.shape[0]
print ('Variable {} has{} records ({:.2%}) with missing values.'.format(feature, missings, missings_perc))
print('In total, there are {} variables with missing values'.format(len(vars_with_missing)))
NaN values = 269697 Variable DISEASE GROUPING 1 has5 records (0.26%) with missing values. Variable DISEASE GROUPING 2 has5 records (0.26%) with missing values. Variable DISEASE GROUPING 3 has5 records (0.26%) with missing values. Variable DISEASE GROUPING 4 has5 records (0.26%) with missing values. Variable DISEASE GROUPING 5 has5 records (0.26%) with missing values. Variable DISEASE GROUPING 6 has5 records (0.26%) with missing values. Variable HTN has5 records (0.26%) with missing values. Variable IMMUNOCOMPROMISED has5 records (0.26%) with missing values. Variable OTHER has5 records (0.26%) with missing values. Variable ALBUMIN_MEDIAN has1105 records (57.40%) with missing values. Variable ALBUMIN_MEAN has1105 records (57.40%) with missing values. Variable ALBUMIN_MIN has1105 records (57.40%) with missing values. Variable ALBUMIN_MAX has1105 records (57.40%) with missing values. Variable ALBUMIN_DIFF has1925 records (100.00%) with missing values. Variable BE_ARTERIAL_MEDIAN has1850 records (96.10%) with missing values. Variable BE_ARTERIAL_MEAN has1850 records (96.10%) with missing values. Variable BE_ARTERIAL_MIN has1850 records (96.10%) with missing values. Variable BE_ARTERIAL_MAX has1850 records (96.10%) with missing values. Variable BE_ARTERIAL_DIFF has1925 records (100.00%) with missing values. Variable BE_VENOUS_MEDIAN has1687 records (87.64%) with missing values. Variable BE_VENOUS_MEAN has1687 records (87.64%) with missing values. Variable BE_VENOUS_MIN has1687 records (87.64%) with missing values. Variable BE_VENOUS_MAX has1687 records (87.64%) with missing values. Variable BE_VENOUS_DIFF has1925 records (100.00%) with missing values. Variable BIC_ARTERIAL_MEDIAN has1105 records (57.40%) with missing values. Variable BIC_ARTERIAL_MEAN has1105 records (57.40%) with missing values. Variable BIC_ARTERIAL_MIN has1105 records (57.40%) with missing values. Variable BIC_ARTERIAL_MAX has1105 records (57.40%) with missing values. Variable BIC_ARTERIAL_DIFF has1925 records (100.00%) with missing values. Variable BIC_VENOUS_MEDIAN has1105 records (57.40%) with missing values. Variable BIC_VENOUS_MEAN has1105 records (57.40%) with missing values. Variable BIC_VENOUS_MIN has1105 records (57.40%) with missing values. Variable BIC_VENOUS_MAX has1105 records (57.40%) with missing values. Variable BIC_VENOUS_DIFF has1925 records (100.00%) with missing values. Variable BILLIRUBIN_MEDIAN has1105 records (57.40%) with missing values. Variable BILLIRUBIN_MEAN has1105 records (57.40%) with missing values. Variable BILLIRUBIN_MIN has1105 records (57.40%) with missing values. Variable BILLIRUBIN_MAX has1105 records (57.40%) with missing values. Variable BILLIRUBIN_DIFF has1925 records (100.00%) with missing values. Variable BLAST_MEDIAN has1918 records (99.64%) with missing values. Variable BLAST_MEAN has1918 records (99.64%) with missing values. Variable BLAST_MIN has1918 records (99.64%) with missing values. Variable BLAST_MAX has1918 records (99.64%) with missing values. Variable BLAST_DIFF has1925 records (100.00%) with missing values. Variable CALCIUM_MEDIAN has1105 records (57.40%) with missing values. Variable CALCIUM_MEAN has1105 records (57.40%) with missing values. Variable CALCIUM_MIN has1105 records (57.40%) with missing values. Variable CALCIUM_MAX has1105 records (57.40%) with missing values. Variable CALCIUM_DIFF has1925 records (100.00%) with missing values. Variable CREATININ_MEDIAN has1105 records (57.40%) with missing values. Variable CREATININ_MEAN has1105 records (57.40%) with missing values. Variable CREATININ_MIN has1105 records (57.40%) with missing values. Variable CREATININ_MAX has1105 records (57.40%) with missing values. Variable CREATININ_DIFF has1925 records (100.00%) with missing values. Variable FFA_MEDIAN has1105 records (57.40%) with missing values. Variable FFA_MEAN has1105 records (57.40%) with missing values. Variable FFA_MIN has1105 records (57.40%) with missing values. Variable FFA_MAX has1105 records (57.40%) with missing values. Variable FFA_DIFF has1925 records (100.00%) with missing values. Variable GGT_MEDIAN has1105 records (57.40%) with missing values. Variable GGT_MEAN has1105 records (57.40%) with missing values. Variable GGT_MIN has1105 records (57.40%) with missing values. Variable GGT_MAX has1105 records (57.40%) with missing values. Variable GGT_DIFF has1925 records (100.00%) with missing values. Variable GLUCOSE_MEDIAN has1105 records (57.40%) with missing values. Variable GLUCOSE_MEAN has1105 records (57.40%) with missing values. Variable GLUCOSE_MIN has1105 records (57.40%) with missing values. Variable GLUCOSE_MAX has1105 records (57.40%) with missing values. Variable GLUCOSE_DIFF has1925 records (100.00%) with missing values. Variable HEMATOCRITE_MEDIAN has1105 records (57.40%) with missing values. Variable HEMATOCRITE_MEAN has1105 records (57.40%) with missing values. Variable HEMATOCRITE_MIN has1105 records (57.40%) with missing values. Variable HEMATOCRITE_MAX has1105 records (57.40%) with missing values. Variable HEMATOCRITE_DIFF has1925 records (100.00%) with missing values. Variable HEMOGLOBIN_MEDIAN has1105 records (57.40%) with missing values. Variable HEMOGLOBIN_MEAN has1105 records (57.40%) with missing values. Variable HEMOGLOBIN_MIN has1105 records (57.40%) with missing values. Variable HEMOGLOBIN_MAX has1105 records (57.40%) with missing values. Variable HEMOGLOBIN_DIFF has1925 records (100.00%) with missing values. Variable INR_MEDIAN has1105 records (57.40%) with missing values. Variable INR_MEAN has1105 records (57.40%) with missing values. Variable INR_MIN has1105 records (57.40%) with missing values. Variable INR_MAX has1105 records (57.40%) with missing values. Variable INR_DIFF has1925 records (100.00%) with missing values. Variable LACTATE_MEDIAN has1105 records (57.40%) with missing values. Variable LACTATE_MEAN has1105 records (57.40%) with missing values. Variable LACTATE_MIN has1105 records (57.40%) with missing values. Variable LACTATE_MAX has1105 records (57.40%) with missing values. Variable LACTATE_DIFF has1925 records (100.00%) with missing values. Variable LEUKOCYTES_MEDIAN has1105 records (57.40%) with missing values. Variable LEUKOCYTES_MEAN has1105 records (57.40%) with missing values. Variable LEUKOCYTES_MIN has1105 records (57.40%) with missing values. Variable LEUKOCYTES_MAX has1105 records (57.40%) with missing values. Variable LEUKOCYTES_DIFF has1925 records (100.00%) with missing values. Variable LINFOCITOS_MEDIAN has1105 records (57.40%) with missing values. Variable LINFOCITOS_MEAN has1105 records (57.40%) with missing values. Variable LINFOCITOS_MIN has1105 records (57.40%) with missing values. Variable LINFOCITOS_MAX has1105 records (57.40%) with missing values. Variable LINFOCITOS_DIFF has1925 records (100.00%) with missing values. Variable NEUTROPHILES_MEDIAN has1105 records (57.40%) with missing values. Variable NEUTROPHILES_MEAN has1105 records (57.40%) with missing values. Variable NEUTROPHILES_MIN has1105 records (57.40%) with missing values. Variable NEUTROPHILES_MAX has1105 records (57.40%) with missing values. Variable NEUTROPHILES_DIFF has1925 records (100.00%) with missing values. Variable P02_ARTERIAL_MEDIAN has1105 records (57.40%) with missing values. Variable P02_ARTERIAL_MEAN has1105 records (57.40%) with missing values. Variable P02_ARTERIAL_MIN has1105 records (57.40%) with missing values. Variable P02_ARTERIAL_MAX has1105 records (57.40%) with missing values. Variable P02_ARTERIAL_DIFF has1925 records (100.00%) with missing values. Variable P02_VENOUS_MEDIAN has1105 records (57.40%) with missing values. Variable P02_VENOUS_MEAN has1105 records (57.40%) with missing values. Variable P02_VENOUS_MIN has1105 records (57.40%) with missing values. Variable P02_VENOUS_MAX has1105 records (57.40%) with missing values. Variable P02_VENOUS_DIFF has1925 records (100.00%) with missing values. Variable PC02_ARTERIAL_MEDIAN has1105 records (57.40%) with missing values. Variable PC02_ARTERIAL_MEAN has1105 records (57.40%) with missing values. Variable PC02_ARTERIAL_MIN has1105 records (57.40%) with missing values. Variable PC02_ARTERIAL_MAX has1105 records (57.40%) with missing values. Variable PC02_ARTERIAL_DIFF has1925 records (100.00%) with missing values. Variable PC02_VENOUS_MEDIAN has1106 records (57.45%) with missing values. Variable PC02_VENOUS_MEAN has1106 records (57.45%) with missing values. Variable PC02_VENOUS_MIN has1106 records (57.45%) with missing values. Variable PC02_VENOUS_MAX has1106 records (57.45%) with missing values. Variable PC02_VENOUS_DIFF has1925 records (100.00%) with missing values. Variable PCR_MEDIAN has1113 records (57.82%) with missing values. Variable PCR_MEAN has1113 records (57.82%) with missing values. Variable PCR_MIN has1113 records (57.82%) with missing values. Variable PCR_MAX has1113 records (57.82%) with missing values. Variable PCR_DIFF has1925 records (100.00%) with missing values. Variable PH_ARTERIAL_MEDIAN has1105 records (57.40%) with missing values. Variable PH_ARTERIAL_MEAN has1105 records (57.40%) with missing values. Variable PH_ARTERIAL_MIN has1105 records (57.40%) with missing values. Variable PH_ARTERIAL_MAX has1105 records (57.40%) with missing values. Variable PH_ARTERIAL_DIFF has1925 records (100.00%) with missing values. Variable PH_VENOUS_MEDIAN has1105 records (57.40%) with missing values. Variable PH_VENOUS_MEAN has1105 records (57.40%) with missing values. Variable PH_VENOUS_MIN has1105 records (57.40%) with missing values. Variable PH_VENOUS_MAX has1105 records (57.40%) with missing values. Variable PH_VENOUS_DIFF has1925 records (100.00%) with missing values. Variable PLATELETS_MEDIAN has1105 records (57.40%) with missing values. Variable PLATELETS_MEAN has1105 records (57.40%) with missing values. Variable PLATELETS_MIN has1105 records (57.40%) with missing values. Variable PLATELETS_MAX has1105 records (57.40%) with missing values. Variable PLATELETS_DIFF has1925 records (100.00%) with missing values. Variable POTASSIUM_MEDIAN has1106 records (57.45%) with missing values. Variable POTASSIUM_MEAN has1106 records (57.45%) with missing values. Variable POTASSIUM_MIN has1106 records (57.45%) with missing values. Variable POTASSIUM_MAX has1106 records (57.45%) with missing values. Variable POTASSIUM_DIFF has1925 records (100.00%) with missing values. Variable SAT02_ARTERIAL_MEDIAN has1105 records (57.40%) with missing values. Variable SAT02_ARTERIAL_MEAN has1105 records (57.40%) with missing values. Variable SAT02_ARTERIAL_MIN has1105 records (57.40%) with missing values. Variable SAT02_ARTERIAL_MAX has1105 records (57.40%) with missing values. Variable SAT02_ARTERIAL_DIFF has1925 records (100.00%) with missing values. Variable SAT02_VENOUS_MEDIAN has1105 records (57.40%) with missing values. Variable SAT02_VENOUS_MEAN has1105 records (57.40%) with missing values. Variable SAT02_VENOUS_MIN has1105 records (57.40%) with missing values. Variable SAT02_VENOUS_MAX has1105 records (57.40%) with missing values. Variable SAT02_VENOUS_DIFF has1925 records (100.00%) with missing values. Variable SODIUM_MEDIAN has1105 records (57.40%) with missing values. Variable SODIUM_MEAN has1105 records (57.40%) with missing values. Variable SODIUM_MIN has1105 records (57.40%) with missing values. Variable SODIUM_MAX has1105 records (57.40%) with missing values. Variable SODIUM_DIFF has1925 records (100.00%) with missing values. Variable TGO_MEDIAN has1106 records (57.45%) with missing values. Variable TGO_MEAN has1106 records (57.45%) with missing values. Variable TGO_MIN has1106 records (57.45%) with missing values. Variable TGO_MAX has1106 records (57.45%) with missing values. Variable TGO_DIFF has1925 records (100.00%) with missing values. Variable TGP_MEDIAN has1105 records (57.40%) with missing values. Variable TGP_MEAN has1105 records (57.40%) with missing values. Variable TGP_MIN has1105 records (57.40%) with missing values. Variable TGP_MAX has1105 records (57.40%) with missing values. Variable TGP_DIFF has1925 records (100.00%) with missing values. Variable TTPA_MEDIAN has1105 records (57.40%) with missing values. Variable TTPA_MEAN has1105 records (57.40%) with missing values. Variable TTPA_MIN has1105 records (57.40%) with missing values. Variable TTPA_MAX has1105 records (57.40%) with missing values. Variable TTPA_DIFF has1925 records (100.00%) with missing values. Variable UREA_MEDIAN has1105 records (57.40%) with missing values. Variable UREA_MEAN has1105 records (57.40%) with missing values. Variable UREA_MIN has1105 records (57.40%) with missing values. Variable UREA_MAX has1105 records (57.40%) with missing values. Variable UREA_DIFF has1925 records (100.00%) with missing values. Variable DIMER_MEDIAN has1138 records (59.12%) with missing values. Variable DIMER_MEAN has1138 records (59.12%) with missing values. Variable DIMER_MIN has1138 records (59.12%) with missing values. Variable DIMER_MAX has1138 records (59.12%) with missing values. Variable DIMER_DIFF has1925 records (100.00%) with missing values. Variable BLOODPRESSURE_DIASTOLIC_MEAN has686 records (35.64%) with missing values. Variable BLOODPRESSURE_SISTOLIC_MEAN has687 records (35.69%) with missing values. Variable HEART_RATE_MEAN has686 records (35.64%) with missing values. Variable RESPIRATORY_RATE_MEAN has749 records (38.91%) with missing values. Variable TEMPERATURE_MEAN has695 records (36.10%) with missing values. Variable OXYGEN_SATURATION_MEAN has687 records (35.69%) with missing values. Variable BLOODPRESSURE_DIASTOLIC_MEDIAN has686 records (35.64%) with missing values. Variable BLOODPRESSURE_SISTOLIC_MEDIAN has689 records (35.79%) with missing values. Variable HEART_RATE_MEDIAN has686 records (35.64%) with missing values. Variable RESPIRATORY_RATE_MEDIAN has749 records (38.91%) with missing values. Variable TEMPERATURE_MEDIAN has695 records (36.10%) with missing values. Variable OXYGEN_SATURATION_MEDIAN has687 records (35.69%) with missing values. Variable BLOODPRESSURE_DIASTOLIC_MIN has686 records (35.64%) with missing values. Variable BLOODPRESSURE_SISTOLIC_MIN has689 records (35.79%) with missing values. Variable HEART_RATE_MIN has687 records (35.69%) with missing values. Variable RESPIRATORY_RATE_MIN has789 records (40.99%) with missing values. Variable TEMPERATURE_MIN has695 records (36.10%) with missing values. Variable OXYGEN_SATURATION_MIN has688 records (35.74%) with missing values. Variable BLOODPRESSURE_DIASTOLIC_MAX has686 records (35.64%) with missing values. Variable BLOODPRESSURE_SISTOLIC_MAX has687 records (35.69%) with missing values. Variable HEART_RATE_MAX has686 records (35.64%) with missing values. Variable RESPIRATORY_RATE_MAX has749 records (38.91%) with missing values. Variable TEMPERATURE_MAX has695 records (36.10%) with missing values. Variable OXYGEN_SATURATION_MAX has687 records (35.69%) with missing values. Variable BLOODPRESSURE_DIASTOLIC_DIFF has1309 records (68.00%) with missing values. Variable BLOODPRESSURE_SISTOLIC_DIFF has1300 records (67.53%) with missing values. Variable HEART_RATE_DIFF has1300 records (67.53%) with missing values. Variable RESPIRATORY_RATE_DIFF has1358 records (70.55%) with missing values. Variable TEMPERATURE_DIFF has1284 records (66.70%) with missing values. Variable OXYGEN_SATURATION_DIFF has1294 records (67.22%) with missing values. Variable BLOODPRESSURE_DIASTOLIC_DIFF_REL has1309 records (68.00%) with missing values. Variable BLOODPRESSURE_SISTOLIC_DIFF_REL has1300 records (67.53%) with missing values. Variable HEART_RATE_DIFF_REL has1300 records (67.53%) with missing values. Variable RESPIRATORY_RATE_DIFF_REL has1358 records (70.55%) with missing values. Variable TEMPERATURE_DIFF_REL has1284 records (66.70%) with missing values. Variable OXYGEN_SATURATION_DIFF_REL has1294 records (67.22%) with missing values. In total, there are 225 variables with missing values
Above method is used to checked the number of NaN values or missing values.
pd.DataFrame({"Columns": data_icu.columns,"Missing_values":((data.isna()).sum()/data_icu.shape[0])*100})
| Columns | Missing_values | |
|---|---|---|
| PATIENT_VISIT_IDENTIFIER | PATIENT_VISIT_IDENTIFIER | 0.000000 |
| AGE_ABOVE65 | AGE_ABOVE65 | 0.000000 |
| AGE_PERCENTIL | AGE_PERCENTIL | 0.000000 |
| GENDER | GENDER | 0.000000 |
| DISEASE GROUPING 1 | DISEASE GROUPING 1 | 0.259740 |
| ... | ... | ... |
| RESPIRATORY_RATE_DIFF_REL | RESPIRATORY_RATE_DIFF_REL | 38.857143 |
| TEMPERATURE_DIFF_REL | TEMPERATURE_DIFF_REL | 36.051948 |
| OXYGEN_SATURATION_DIFF_REL | OXYGEN_SATURATION_DIFF_REL | 35.636364 |
| WINDOW | WINDOW | 0.000000 |
| ICU | ICU | 0.000000 |
231 rows × 2 columns
Above table represent how many missing values there are in our dataset in percent.
import missingno as msno
msno.bar(data_icu)
<AxesSubplot:>
We import missingno library for visualization our data. we plot bar graph for visualization of missing values.
msno.heatmap(data_icu)
<AxesSubplot:>
Heat map, which is a two-dimensional visual representation of data, each value in a matrix is represented by a different hue.
corelation = data_icu.corr()
sns.heatmap(corelation, xticklabels=corelation.columns, yticklabels=corelation.columns, annot=True)
<AxesSubplot:>
Because of large amount of data, many of Nan Values, Unable to find out Correlation between any columns.
pd.pivot_table(data_icu, index=['ICU', 'GENDER'], columns = ['AGE_ABOVE65'], aggfunc=len)
| AGE_PERCENTIL | ALBUMIN_DIFF | ALBUMIN_MAX | ALBUMIN_MEAN | ALBUMIN_MEDIAN | ... | UREA_MAX | UREA_MEAN | UREA_MEDIAN | UREA_MIN | WINDOW | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AGE_ABOVE65 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | ... | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | |
| ICU | GENDER | |||||||||||||||||||||
| 0 | 0 | 527 | 336 | 527 | 336 | 527 | 336 | 527 | 336 | 527 | 336 | ... | 527 | 336 | 527 | 336 | 527 | 336 | 527 | 336 | 527 | 336 |
| 1 | 314 | 233 | 314 | 233 | 314 | 233 | 314 | 233 | 314 | 233 | ... | 314 | 233 | 314 | 233 | 314 | 233 | 314 | 233 | 314 | 233 | |
| 1 | 0 | 143 | 209 | 143 | 209 | 143 | 209 | 143 | 209 | 143 | 209 | ... | 143 | 209 | 143 | 209 | 143 | 209 | 143 | 209 | 143 | 209 |
| 1 | 41 | 122 | 41 | 122 | 41 | 122 | 41 | 122 | 41 | 122 | ... | 41 | 122 | 41 | 122 | 41 | 122 | 41 | 122 | 41 | 122 | |
4 rows × 456 columns
0 - Patient not admitted in ICU
1 - Patient admitted in ICU
0 - Represent Patient is not Critical.
1 - Represent Patient is in Critical condition due to Covid-19.
Outside of ICU, Females are less critical than Males.
More females than males were admitted to the ICU.
The word "Uni" means "one," therefore "univariate analysis" refers to the study of only one variable at a time.
For univariant observation we use distplot,data distribution of a variable against the density distribution.
sns.distplot(data_icu['TEMPERATURE_DIFF'])
plt.show()
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
Temperature density of the patients countinously surge and reached to 1.25 and then follow decline pattern in betwwen -0.5 to 0.C
sns.distplot(data_icu['OXYGEN_SATURATION_DIFF'])
plt.show()
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
Patient Oxygen saturation Density lies in between -0.1 to 0.5 and reached to peak of around 2.75.
sns.distplot(data_icu['BLOODPRESSURE_DIASTOLIC_DIFF_REL'])
plt.show()
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
The highest density of a Bloodpressure_diastolic_diff_rel reached at a peak of 1.3 .
sns.distplot(data_icu['BLOODPRESSURE_SISTOLIC_DIFF_REL'])
plt.show()
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
Here the Density peak is 1.2.
sns.distplot(data_icu['HEART_RATE_DIFF_REL'])
plt.show()
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
The maximum density for HEART_RATE_DIFF_REL IS 1.5.
sns.distplot(data_icu['RESPIRATORY_RATE_DIFF_REL'])
plt.show()
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
The maximun density for RESPIRATORY_RATE_DIFF_REL is 1.2.
sns.distplot(data_icu['TEMPERATURE_DIFF_REL'])
plt.show()
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
The maximun density for TEMPERATURE_DIFF_REL is 1.25.
sns.distplot(data_icu['OXYGEN_SATURATION_DIFF_REL'])
plt.show()
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
The maximun density for OXYGEN_SATURATION_DIFF_REL is 2.8.
sns.distplot(data['ICU'])
plt.show()
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
The maximun density for ICU is 3.
sns.distplot(data_icu['PATIENT_VISIT_IDENTIFIER'])
plt.show()
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
a=data_icu['ICU'].value_counts()
plt.pie(a,labels = ['NON-ICU', 'ICU'])
plt.show()
According to pie chart more then half patients are not required ICU beds .
Analyzing two variables simultaneously is known as bivariate analysis.
sns.boxplot(x="GENDER" , y="AGE_ABOVE65",data = data_icu)
<AxesSubplot:xlabel='GENDER', ylabel='AGE_ABOVE65'>
Men and women over the age of 65 are both equally affected. covid -19.
sns.boxplot(x="ICU", y="AGE_ABOVE65" , data = data_icu)
<AxesSubplot:xlabel='ICU', ylabel='AGE_ABOVE65'>
ICU and non-ICU patients with affected COVID rates were equal.
age = sns.countplot(x='AGE_ABOVE65' , hue='GENDER' , data=data_icu)
for p in age.patches:
height = p.get_height()
age.text(p.get_x() + p.get_width()/2. , height + 0.1, height, ha="center")
icu=sns.countplot(x="ICU", hue = "GENDER", data = data_icu)
for p in icu.patches:
height = p.get_height()
icu.text(p.get_x() + p.get_width()/2., height + 0.1,height, ha="center")
In compare to Female patient, Male covid patient are highly admitted.
DATA CLEANING
It is necessary to clean our data before performing ml algorithm.
drop_cols = ['TEMPRETURE_DIFF' , 'OXYGEN_SATURATION_DIFF', 'BLODDPRESSURE_DIASTOLIC_DIFF_REL' 'BLOODPRESSURE_SISTOLIC_DIFF_REL']
data_icu
| PATIENT_VISIT_IDENTIFIER | AGE_ABOVE65 | AGE_PERCENTIL | GENDER | DISEASE GROUPING 1 | DISEASE GROUPING 2 | DISEASE GROUPING 3 | DISEASE GROUPING 4 | DISEASE GROUPING 5 | DISEASE GROUPING 6 | ... | TEMPERATURE_DIFF | OXYGEN_SATURATION_DIFF | BLOODPRESSURE_DIASTOLIC_DIFF_REL | BLOODPRESSURE_SISTOLIC_DIFF_REL | HEART_RATE_DIFF_REL | RESPIRATORY_RATE_DIFF_REL | TEMPERATURE_DIFF_REL | OXYGEN_SATURATION_DIFF_REL | WINDOW | ICU | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 60 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2 | 0 |
| 1 | 0 | 1 | 60 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 4 | 0 |
| 2 | 0 | 1 | 60 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 6 | 0 |
| 3 | 0 | 1 | 60 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 12 | 0 |
| 4 | 0 | 1 | 60 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | -0.238095 | -0.818182 | -0.389967 | 0.407558 | -0.230462 | 0.096774 | -0.242282 | -0.814433 | 13 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1920 | 384 | 0 | 50 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2 | 0 |
| 1921 | 384 | 0 | 50 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 4 | 0 |
| 1922 | 384 | 0 | 50 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 6 | 0 |
| 1923 | 384 | 0 | 50 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 12 | 0 |
| 1924 | 384 | 0 | 50 | 1 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | -0.547619 | -0.838384 | -0.701863 | -0.585967 | -0.763868 | -0.612903 | -0.551337 | -0.835052 | 13 | 0 |
1925 rows × 231 columns
dataset = data_icu.copy()
data = data_icu.fillna(0)
dataset = data.fillna(0)
The fillna() function substitutes a given value for any NULL values. We replaced Null values with 0.
import missingno as msno
msno.bar(data)
<AxesSubplot:>
We import missingno library for visualization our data. we plot bar graph for visualization of missing values.After filling nun value.
An outlier is a data point in a data set that is distant from all other observations. A data point that lies outside the overall distribution of the dataset.
from plotly.subplots import make_subplots
import plotly.graph_objects as go
fig = make_subplots(rows=2, cols=4)
fig.add_trace(go.Box(y=data_icu['BLOODPRESSURE_SISTOLIC_MAX'],name='BLOODPRESSURE_SISTOLIC_MAX'),row=1,col=1)
fig.add_trace(go.Box(y=data_icu['HEART_RATE_MAX'],name='HEART_RATE_MAX'),row=1,col=2)
fig.add_trace(go.Box(y=data_icu['RESPIRATORY_RATE_MAX'],name='RESPIRATORY_RATE_MAX'),row=1,col=3)
fig.add_trace(go.Box(y=data_icu['TEMPERATURE_MAX'],name='TEMPERATURE_MAX'),row=1,col=4)
fig.add_trace(go.Box(y=data_icu['OXYGEN_SATURATION_MAX'],name='OXYGEN_SATURATION_MAX'),row=2,col=1)
fig.add_trace(go.Box(y=data_icu['BLOODPRESSURE_DIASTOLIC_DIFF'],name='BLOODPRESSURE_DIASTOLIC_DIFF'),row=2,col=2)
fig.add_trace(go.Box(y=data_icu['BLOODPRESSURE_SISTOLIC_DIFF'],name='BLODDPRESSURE_SISTOLIC_DIFF'),row=2,col=3)
fig.add_trace(go.Box(y=data_icu['HEART_RATE_DIFF'],name='HEART_RATE_DIFF'),row=2,col=4)
fig.show()
data_icu.BLOODPRESSURE_SISTOLIC_MAX.mean()
data_icu.BLOODPRESSURE_SISTOLIC_MAX.std()
data_icu.BLOODPRESSURE_SISTOLIC_MAX.describe()
count 1238.000000 mean -0.398612 std 0.286796 min -0.989189 25% -0.578378 50% -0.459459 75% -0.243243 max 1.000000 Name: BLOODPRESSURE_SISTOLIC_MAX, dtype: float64
upper_limit = data.BLOODPRESSURE_SISTOLIC_MAX.mean() + 3*data_icu.BLOODPRESSURE_SISTOLIC_MAX.std()
upper_limit
0.6040340948294227
According to our result upper result is 0.60
lower_limit = data_icu.BLOODPRESSURE_SISTOLIC_MAX.mean() - 3*data_icu.BLOODPRESSURE_SISTOLIC_MAX.std()
lower_limit
-1.2589994388123775
According to our result, lower limit -1.25
x=dataset.drop('ICU', axis=1)
x
| PATIENT_VISIT_IDENTIFIER | AGE_ABOVE65 | AGE_PERCENTIL | GENDER | DISEASE GROUPING 1 | DISEASE GROUPING 2 | DISEASE GROUPING 3 | DISEASE GROUPING 4 | DISEASE GROUPING 5 | DISEASE GROUPING 6 | ... | RESPIRATORY_RATE_DIFF | TEMPERATURE_DIFF | OXYGEN_SATURATION_DIFF | BLOODPRESSURE_DIASTOLIC_DIFF_REL | BLOODPRESSURE_SISTOLIC_DIFF_REL | HEART_RATE_DIFF_REL | RESPIRATORY_RATE_DIFF_REL | TEMPERATURE_DIFF_REL | OXYGEN_SATURATION_DIFF_REL | WINDOW | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 60 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2 |
| 1 | 0 | 1 | 60 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 4 |
| 2 | 0 | 1 | 60 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6 |
| 3 | 0 | 1 | 60 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 12 |
| 4 | 0 | 1 | 60 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 0.176471 | -0.238095 | -0.818182 | -0.389967 | 0.407558 | -0.230462 | 0.096774 | -0.242282 | -0.814433 | 13 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1920 | 384 | 0 | 50 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2 |
| 1921 | 384 | 0 | 50 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 4 |
| 1922 | 384 | 0 | 50 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6 |
| 1923 | 384 | 0 | 50 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 12 |
| 1924 | 384 | 0 | 50 | 1 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | -0.647059 | -0.547619 | -0.838384 | -0.701863 | -0.585967 | -0.763868 | -0.612903 | -0.551337 | -0.835052 | 13 |
1925 rows × 230 columns
y=dataset['ICU']
y
0 0
1 0
2 0
3 0
4 1
..
1920 0
1921 0
1922 0
1923 0
1924 0
Name: ICU, Length: 1925, dtype: int64
we are spliting our data for running our machine learning modelieng.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test=train_test=train_test_split(x,y,test_size=0.2, random_state=95)
x_train
| PATIENT_VISIT_IDENTIFIER | AGE_ABOVE65 | AGE_PERCENTIL | GENDER | DISEASE GROUPING 1 | DISEASE GROUPING 2 | DISEASE GROUPING 3 | DISEASE GROUPING 4 | DISEASE GROUPING 5 | DISEASE GROUPING 6 | ... | RESPIRATORY_RATE_DIFF | TEMPERATURE_DIFF | OXYGEN_SATURATION_DIFF | BLOODPRESSURE_DIASTOLIC_DIFF_REL | BLOODPRESSURE_SISTOLIC_DIFF_REL | HEART_RATE_DIFF_REL | RESPIRATORY_RATE_DIFF_REL | TEMPERATURE_DIFF_REL | OXYGEN_SATURATION_DIFF_REL | WINDOW | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 601 | 120 | 1 | 80 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 4 |
| 1836 | 367 | 1 | 90 | 1 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 4 |
| 1821 | 364 | 1 | 90 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 4 |
| 280 | 56 | 0 | 50 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 2 |
| 547 | 109 | 0 | 30 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 6 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 773 | 154 | 0 | 40 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 12 |
| 118 | 23 | 0 | 40 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | -0.761905 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | -0.767053 | 0.0 | 12 |
| 1555 | 311 | 1 | 60 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 2 |
| 1321 | 264 | 1 | 60 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 4 |
| 1430 | 286 | 0 | 40 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 2 |
1540 rows × 230 columns
y_train
601 0
1836 0
1821 1
280 0
547 1
..
773 0
118 0
1555 0
1321 0
1430 1
Name: ICU, Length: 1540, dtype: int64
x_test
| PATIENT_VISIT_IDENTIFIER | AGE_ABOVE65 | AGE_PERCENTIL | GENDER | DISEASE GROUPING 1 | DISEASE GROUPING 2 | DISEASE GROUPING 3 | DISEASE GROUPING 4 | DISEASE GROUPING 5 | DISEASE GROUPING 6 | ... | RESPIRATORY_RATE_DIFF | TEMPERATURE_DIFF | OXYGEN_SATURATION_DIFF | BLOODPRESSURE_DIASTOLIC_DIFF_REL | BLOODPRESSURE_SISTOLIC_DIFF_REL | HEART_RATE_DIFF_REL | RESPIRATORY_RATE_DIFF_REL | TEMPERATURE_DIFF_REL | OXYGEN_SATURATION_DIFF_REL | WINDOW | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1647 | 329 | 1 | 90 | 0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6 |
| 1607 | 321 | 1 | 60 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6 |
| 1421 | 284 | 0 | 20 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 4 |
| 1 | 0 | 1 | 60 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 4 |
| 1220 | 244 | 1 | 60 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1563 | 312 | 0 | 20 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 12 |
| 1102 | 220 | 1 | 100 | 0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6 |
| 1680 | 336 | 0 | 10 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2 |
| 442 | 88 | 0 | 10 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6 |
| 1096 | 219 | 1 | 100 | 1 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | -0.529412 | -0.928571 | -0.979798 | -0.97756 | -0.886472 | -0.989267 | -0.697442 | -0.929831 | -0.979601 | 4 |
385 rows × 230 columns
y_test
1647 1
1607 0
1421 0
1 0
1220 0
..
1563 0
1102 0
1680 0
442 0
1096 1
Name: ICU, Length: 385, dtype: int64
One of the strongest and most well-liked algorithms is the decision tree. The decision-tree algorithm is a type of supervised learning method. It functions with output variables that are categorised and continuous.
from sklearn.tree import DecisionTreeClassifier, plot_tree
dtc=DecisionTreeClassifier()
dtc.fit(x_train, y_train)
DecisionTreeClassifier()
y_pred1=dtc.predict(x_test)
from sklearn.metrics import accuracy_score,confusion_matrix
accuracy_score(y_pred1,y_test)
0.7688311688311689
The accuracy of this model is 0.78
confusion_matrix(y_pred1,y_test)
array([[242, 47],
[ 42, 54]], dtype=int64)
plt.figure(figsize=(40,40)) # set plot size (denoted in inches)
plot_tree(dtc, filled=True, fontsize=10)
plt.show()
Random forest selects a random sample from the training set, creates a decision tree for it and gets a prediction; it repeats this operation for the assigned number of the trees, performs a vote for each prediction, and takes the result with the majority of votes (in case of classification) or the average
from sklearn.ensemble import RandomForestClassifier
regressor = RandomForestClassifier()
regressor.fit(x_train, y_train)
RandomForestClassifier()
y_pred=regressor.predict(x_test)
from sklearn import metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test,y_pred))
print('Mean Squared Error:' , metrics.mean_squared_error(y_pred,y_test))
print('Root Mean Squared Error:',np.sqrt(metrics.mean_squared_error(y_test,y_pred)))
print('R-Squared', r2_score(y_pred, y_test))
Mean Absolute Error: 0.14545454545454545 Mean Squared Error: 0.14545454545454545 Root Mean Squared Error: 0.3813850356982369 R-Squared 0.09090909090909127
from sklearn.metrics import accuracy_score,confusion_matrix
accuracy_score(y_pred,y_test)
0.8545454545454545
The accuracy for this model is 0.841
confusion_matrix (y_pred,y_test)
array([[268, 40],
[ 16, 61]], dtype=int64)
plt.figure(figsize=(8,6))
plt.plot(y_test,y_test,color='deeppink')
plt.scatter(y_test,y_pred,color='dodgerblue')
plt.xlabel('Actual Target Value',fontsize=15)
plt.ylabel('Predicted Traget Value',fontsize=15)
plt.title('Random Forest Regressor',fontsize=14)
plt.show()
XGBoost is used for supervised learning problems, where we use the training data (with multiple features) to predict a target variable.
from xgboost import XGBClassifier
Classifier = XGBClassifier()
Classifier.fit(x_train, y_train)
XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
early_stopping_rounds=None, enable_categorical=False,
eval_metric=None, feature_types=None, gamma=0, gpu_id=-1,
grow_policy='depthwise', importance_type=None,
interaction_constraints='', learning_rate=0.300000012,
max_bin=256, max_cat_threshold=64, max_cat_to_onehot=4,
max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
missing=nan, monotone_constraints='()', n_estimators=100,
n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=0, ...)
y_pred2=Classifier.predict(x_test)
accuracy_score(y_pred2,y_test)
0.8831168831168831
The accuracy for this model is 0.88
confusion_matrix(y_pred2,y_test)
array([[270, 31],
[ 14, 70]], dtype=int64)
Support Vector Machine” (SVM) is a supervised machine learning algorithm that can be used for both classification or regression challenges. However, it is mostly used in classification problems.
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
svm = SVC()
param_grid = {
'C': [0.001, 0.01, 0.1, 1, 10, 100],
'gamma': [1, 0.1, 0.01, 0.001]
}
cv_svm = GridSearchCV(estimator=svm, param_grid=param_grid, cv=5)
cv_svm.fit(x_train, y_train.values.ravel())
print("Support Vector Machines Model best params: ", cv_svm.best_params_)
# Training model with best params
best_params = cv_svm.best_params_
svm_best = SVC(random_state = 42,
C = best_params['C'],
gamma = best_params['gamma'])
svm_best.fit(x_train, y_train)
y_pred = svm_best.predict(x_test)
# Evaluating the model
print("Accuracy for Support Vector Machines is : ", round(accuracy_score(y_test, y_pred), 2))
print("\n\nClassification report for Support Vector Machines:")
print(classification_report(y_test, y_pred))
print("\n\nConfusion matrix for Support Vector Machines:")
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True)
Support Vector Machines Model best params: {'C': 100, 'gamma': 0.001}
Accuracy for Support Vector Machines is : 0.83
Classification report for Support Vector Machines:
precision recall f1-score support
0 0.87 0.91 0.89 284
1 0.71 0.61 0.66 101
accuracy 0.83 385
macro avg 0.79 0.76 0.77 385
weighted avg 0.83 0.83 0.83 385
Confusion matrix for Support Vector Machines:
<AxesSubplot:>
The accuracy for this model is 0.83.
clfs = {"SVM":SVC(kernel='rbf', probability=True),
"DecisionTree":DecisionTreeClassifier(),
"RandomForest":RandomForestClassifier(),
"XGBoost":XGBClassifier(verbosity=0)}
import math
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
def model_fit(clfs):
fitted_model = {}
model_result = pd.DataFrame()
for model_name, model in clfs.items():
model.fit(x_train,y_train)
fitted_model.update({model_name:model})
y_pred =model.predict(x_test)
model_dict = {}
model_dict['1.Algorithm'] = model_name
model_dict['2.Accuracy'] = round(accuracy_score(y_test, y_pred),3)
model_dict['3.Precision'] = round(precision_score(y_test,y_pred),3)
model_dict['4.Recall'] = round(recall_score(y_test,y_pred),3)
model_dict['5.F1'] = round(f1_score(y_test, y_pred),3)
model_dict['6.ROC'] = round(roc_auc_score(y_test, y_pred),3)
model_result = model_result.append(model_dict,ignore_index=True)
return fitted_model, model_result
fitted_model, model_result = model_fit(clfs)
model_result.sort_values(by=['2.Accuracy'],ascending=False)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1318: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior. C:\Users\rahul\AppData\Local\Temp\ipykernel_20408\3221118475.py:21: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. C:\Users\rahul\AppData\Local\Temp\ipykernel_20408\3221118475.py:21: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. C:\Users\rahul\AppData\Local\Temp\ipykernel_20408\3221118475.py:21: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. C:\Users\rahul\AppData\Local\Temp\ipykernel_20408\3221118475.py:21: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
| 1.Algorithm | 2.Accuracy | 3.Precision | 4.Recall | 5.F1 | 6.ROC | |
|---|---|---|---|---|---|---|
| 3 | XGBoost | 0.883 | 0.833 | 0.693 | 0.757 | 0.822 |
| 2 | RandomForest | 0.849 | 0.795 | 0.574 | 0.667 | 0.761 |
| 1 | DecisionTree | 0.764 | 0.553 | 0.515 | 0.533 | 0.683 |
| 0 | SVM | 0.738 | 0.000 | 0.000 | 0.000 | 0.500 |
model_ordered = []
weights = []
i=1
for model_name in model_result['1.Algorithm'][
index_natsorted(model_result['2.Accuracy'],reverse=False)]:
model_ordered.append([model_name,clfs.get(model_name)])
weights.append(math.exp(i))
i+=0.8
plt.plot(weights)
plt.show()
weights
[2.718281828459045, 6.0496474644129465, 13.463738035001692, 29.964100047397025]
The best algorithm for brazil covid - 19 dataset is XGBoost, providing the best Performance.
Introduction In this model, We will utilize the Hospital Sirio-Libanes data set to determine if the covid patient needs an ICU bed or not. A machine learning strategy was used to maximize the usage of ICU beds because there were not enough of them due to the covid pandemic crisis in Brazil. Therefore, depending on the patient's present medical status, we will utilize the data set to attempt to categorise whether the patient would need an ICU bed.
Basically, the ML model divided into two parts
1 EDA – Data quality issues, univariant, bivariant, data preparation(data cleaning ).
2 ML algorithm - Decision Tree, Random forest, XG Boost, and SVM. Modeling and Model Evaluation
Exploratory Data Analysis (EDA) In order to recognize the inconsistencies and missing values in the data set, we performed exploratory data analysis on it. The following is a list of some of the data set's examination: The data set contained 1925 raw, 231 columns. Having some missing values. We found some objects were strings(AGE_Percitile and ICU), so we converted them into float. We plot some graphs for better visualization. Bar plot, heat map and Pivot. Univariant exploration - The word "Uni" means "one," therefore "univariate analysis" refers to the study of only one variable at a time. For univariant observation, we use distplot, data distribution of a variable against the density distribution. We visualized variables to understand the data distribution. Bivariant exploration - Analysing two variables simultaneously is known as bivariate analysis. We plot graphs between, gender and age above 65, ICU and age. For an understanding of the distribution of ICU beds.
Data preparation: we clean the data for our machine-learning modelling by using the fillina method which replaces nun values with 0. Then plot a graph to understand the nun values after fillina method.
Identified Outliers: An outlier is a data point in a data set that is distant from all other observations. A data point that lies outside the overall distribution of the dataset. We plot Outliers for variables and check upper and lower values for BLOODPRESSURE_SISTOLIC_MAX.
Machine learning (ML): To perform the ml algorithm first we have to train the data model into x and y – test and train, In order to train the model, 80% of the data set was used, while the remaining 20% was used for model assessment(test).
The sklearn machine learning library has been used to generate machine learning classification models. The models we will design are:
1 . Decision Tree Classifier :The Decision Tree model is performing is a bit less accurate. The accuracy we get is 0.78. we plotted the graph and show the confusion matrix as well.
3 . XG boost : XG boost model shows the highest accuracy compared to all models and the accuracy is 0.88.
4 Support Vector Machines: The SVM model is predicted a little bit accurately for the patients who don’t require ICU beds. The overall accuracy is 0.83.
Modeling and Model Evaluation Ensemble Voting Method
We performed an ensemble voting method for representing our models and their feature in a table and graph.
Conclusion : In the project, We were able to use machine learning technology to determine if the patient would need an ICU bed upon hospital admission. We were able to achieve the best accuracy of 88% in predicting if the patient admitted to the hospital would eventually be required to admit to the ICU. In order to improve the accuracy and predictions from the model, we may strive to collect additional data and aim to attain even greater accuracy in future work.